Introduction

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors. The core data set is related to the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA which is publicly available in http://capitalbikeshare.com/system-data or can be downloaded from UCI Machine Learning repository https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset. The data is aggregated on daily basis and then extracted and added the corresponding weather and seasonal information.


The dataset consist of following variables:

- instant: record index
- dteday : date
- season : season (1:spring, 2:summer, 3:fall, 4:winter)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
+ weathersit : 
    - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered


We will use this dataset to predict the count of rental bikes(cnt variable) based upon the features we have seen above.


Source: The core data set is related to the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA which is publicly available in http://capitalbikeshare.com/system-data.


The purpose for this work is to understand how variables in the bike sharing dataset combine in complex relationships to predict the demand for bike and how attributes like like weather and time affect a quantitative outcome of bike rentals. The quantification of the relationship will assist in forecasting demand for bike rentals in the short term.



Method


Load and Clean the Data

The data consists of CSV files containing daily and hourly summaries of bike rentals with the corresponding weather. We decided to focus on the daily data.

data = read.csv("day.csv")
str(data)
## 'data.frame':    731 obs. of  16 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 0 1 2 3 4 5 6 0 1 ...
##  $ workingday: int  0 0 1 1 1 1 1 0 0 1 ...
##  $ weathersit: int  2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
##  $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
##  $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
##  $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...
##  $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
##  $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...


In the imported data the only categorical feature is the date, which we will remove since it is unique to each row. Some of the features should be categorical, so we will convert them to factors and assign meaningful levels.

data$season = as.factor(data$season)
levels(data$season) <- c("spring", "summer", "fall", "winter") # load it as a factor with levels

data$holiday = as.factor(data$holiday)
levels(data$holiday) = c("no", "yes") # load it as a factor with levels

data$workingday = as.factor(data$workingday)
levels(data$workingday) = c("no", "yes") # load it as a factor with levels

data$weathersit = as.factor(data$weathersit)
levels(data$weathersit) = c("Clearish", "Misty", "LightPrecip") # load it as a factor with levels

data$weekday = as.factor(data$weekday)
levels(data$weekday) = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat") # load it as a factor with levels

data$mnth = as.factor(data$mnth)
levels(data$mnth) = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec") # load it as a factor with levels

str(data) # look at the structure of transformed data
## 'data.frame':    731 obs. of  16 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ season    : Factor w/ 4 levels "spring","summer",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday   : Factor w/ 7 levels "Sun","Mon","Tue",..: 7 1 2 3 4 5 6 7 1 2 ...
##  $ workingday: Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 2 1 1 2 ...
##  $ weathersit: Factor w/ 3 levels "Clearish","Misty",..: 2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
##  $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
##  $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
##  $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...
##  $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
##  $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...

We should note that although there are four levels to the Weathersit specified in the documentation, only three of those levels actually appear in the data.


Check for Missing Data

sum(is.na(data))
## [1] 0

There are no fields missing in the data.


Exploratory Data Analysis

Before beginning the process of modelling the data we performed exploratory data analysis to gain insight into the distribution of features and their relationships. As this analysis was not directly related to the choice of a model it is included in the Appendix.

The exploratory analysis looked at the distributions of individual features, relationships between features and the response, correlation between features, and tried to find interactions between features as related to the response.


Modeling

First we remove features that will not be used for our modeling. Instance and Date are unique for each row, and we also remove casual and registered as we are modeling total ridership.

data=data[,c(-1,-2,-14,-15)] # remove instance, date, causal, registered variable since those are not needed for our model.

We also create a function to calculate LOOCV-RMSE as we will be using this to evaluate our models.

# function to calculate leave out out cross validated rmse
calc_loocv_rmse = function(model) {
  temp=(resid(model) / (1 - hatvalues(model)))^2
  temp=temp[is.finite(temp)]
  sqrt(mean(temp))
}

Now we will separate numeric and categorical features.

numerical <- unlist(lapply(data, is.numeric)) # contains boolean value against each variable indicating whether that variable is a numeric or not


Numeric Predictors Only

We begin by creating a model using only the numerical predictors to model the target. Diagnostics for this model are in Figure 1.

data_numerical= data[, numerical] # get the target and all the numerical columns
bike_mod_num = lm(cnt ~ ., data = data_numerical) # model with all numerical variables
summary(bike_mod_num)[["adj.r.squared"]] # get the adjusted r-squared 
## [1] 0.7261027
calc_loocv_rmse(bike_mod_num) # get the loocv rmse
## [1] 1041.521

This model has a rather low adjusted \(R^2\) and the cross validated RMSE values are quite high, which indicates that the model is not explaining the response as well as it could.

We now examine the p-values for the coefficients of the model:

summary(bike_mod_num)$coefficients[, 'Pr(>|t|)']
##   (Intercept)            yr          temp         atemp           hum 
##  7.548072e-19 6.262707e-109  2.937966e-01  4.505973e-03  1.193052e-15 
##     windspeed 
##  7.785849e-15

The above results show that the temp is not very significant predictor as it has a high p value, this can be attributed to the fact that temp and atemp were highly correlated as we saw in the EDA and may be because of that, the effect of one gets eaten up by other.

par(mfrow = c(2, 2))
plot(bike_mod_num,col = 'dodgerblue') # do diagnostics

The Fitted vs Residual plot shows a bit of non linear trend and the leverage plot also highlights some outliers.


Numeric and Categorical Predictors

bike_mod_all=lm(cnt~., data=data) # model with all the variables
summary(bike_mod_all)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.8423198
calc_loocv_rmse(bike_mod_all) # get the loocv rmse
## [1] 794.7767

Including the categorical features has substantially improved the adjusted \(R^2\) and lowered the loocv-rmse.

P-values for coefficients for the model:

summary(bike_mod_all)$coefficients[, 'Pr(>|t|)'] # get the cofficeints
##           (Intercept)          seasonsummer            seasonfall 
##          9.774832e-10          1.032402e-06          1.024766e-04 
##          seasonwinter                    yr               mnthFeb 
##          2.279871e-17         2.294966e-154          3.624432e-01 
##               mnthMar               mnthApr               mnthMay 
##          1.084575e-03          6.881974e-02          6.145473e-03 
##               mnthJun               mnthJul               mnthAug 
##          6.842301e-02          9.218536e-01          1.426395e-01 
##               mnthSep               mnthOct               mnthNov 
##          1.651562e-04          3.178651e-02          6.132572e-01 
##               mnthDec            holidayyes            weekdayMon 
##          6.230982e-01          1.129889e-03          5.318673e-02 
##            weekdayTue            weekdayWed            weekdayThu 
##          3.981733e-03          4.136266e-04          3.498183e-04 
##            weekdayFri            weekdaySat       weathersitMisty 
##          5.296026e-05          4.007227e-05          3.156422e-09 
## weathersitLightPrecip                  temp                 atemp 
##          5.386219e-22          4.152649e-02          2.222607e-01 
##                   hum             windspeed 
##          2.013660e-07          2.091887e-11

The p-values indicate that not all of the variables are significant.

Diagnostic plots for the model:

par(mfrow = c(2, 2))
plot(bike_mod_all,col = 'dodgerblue') # do diagnostics

We still see some issues in the diagnostic plots.

  • The Fitted vs Residual plot shows a non linear trend and we also see presence of some extreme outlier.
  • We see some fat tails in the Normal Q-Q plot.
  • The Residuals vs Leverage plot also indicates presence of some outliers which we might have to check on as we go down the analysis.

Based upon the results, we want to examine the significance of several of the categorical variables:

  • Month
  • Week Day
  • Working Day
bike_mod_w_month=lm(cnt~.-mnth, data=data) # model without month
bike_mod_w_weekday=lm(cnt~.-weekday, data=data) # model without weekday
bike_mod_w_workingday=lm(cnt~.-workingday, data=data) # model without workingday
anova(bike_mod_w_month,bike_mod_w_weekday,bike_mod_w_workingday,bike_mod_all) # do anova test
## Analysis of Variance Table
## 
## Model 1: cnt ~ (season + yr + mnth + holiday + weekday + workingday + 
##     weathersit + temp + atemp + hum + windspeed) - mnth
## Model 2: cnt ~ (season + yr + mnth + holiday + weekday + workingday + 
##     weathersit + temp + atemp + hum + windspeed) - weekday
## Model 3: cnt ~ (season + yr + mnth + holiday + weekday + workingday + 
##     weathersit + temp + atemp + hum + windspeed) - workingday
## Model 4: cnt ~ season + yr + mnth + holiday + weekday + workingday + weathersit + 
##     temp + atemp + hum + windspeed
##   Res.Df       RSS Df Sum of Sq       F    Pr(>F)    
## 1    713 472452877                                   
## 2    707 428442679  6  44010198 12.3957 2.683e-13 ***
## 3    702 415401688  5  13040991  4.4077 0.0005852 ***
## 4    702 415401688  0         0                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The test shows that month, weekday are significant predictors hence we can’t rule them out. Even though the month variable is statistically significant, it might be that just few levels are useful and rest of them does not help. We will use model selection schemes later to investigate that.

According to test results working day variable seems to be non significant and we can rule out this variable.

We will now re-fit the model excluding working day:

data_2= data[,c(-6)] # remove working day variable
bike_mod_all_2=lm(cnt~., data=data_2) # model with all remaining variable
summary(bike_mod_all_2)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.8423198
calc_loocv_rmse(bike_mod_all_2) # get the loocv rmse
## [1] 794.7767

Excluding working day had no effect on the \(R^2\) of the model, so we can safely remove that feature.


Identifying Co linearity

Now we will using the Variance Inflation Factor to determine if we have multi co linearity issues:

vif(bike_mod_all_2)
##          seasonsummer            seasonfall          seasonwinter 
##              7.496334             10.720030              7.455172 
##                    yr               mnthFeb               mnthMar 
##              1.046828              1.835947              2.624298 
##               mnthApr               mnthMay               mnthJun 
##              5.704404              6.868018              7.423137 
##               mnthJul               mnthAug               mnthSep 
##              9.443506              8.813082              6.542133 
##               mnthOct               mnthNov               mnthDec 
##              5.594951              4.956867              3.183722 
##            holidayyes            weekdayMon            weekdayTue 
##              1.121296              1.821782              1.730242 
##            weekdayWed            weekdayThu            weekdayFri 
##              1.741363              1.743011              1.740119 
##            weekdaySat       weathersitMisty weathersitLightPrecip 
##              1.725530              1.642310              1.338411 
##                  temp                 atemp                   hum 
##             80.806689             70.036722              2.140362 
##             windspeed 
##              1.273296

As we had suspected, temp and atemp have a high level of co linearity. We will now look at the partial correlation coefficient between temp and cnt:

temp_model_1 <- lm(temp ~ . - cnt, data = data_2)
temp_model_2 <- lm(cnt ~ . - temp, data = data_2)

cor(resid(temp_model_1), resid(temp_model_2))
## [1] 0.0768418

While this is relatively small, as temp and atemp are highly correlated we should check the partial correlation coefficient after removing atemp:

temp_model_1 <- lm(temp ~ . - cnt - atemp, data = data_2)
temp_model_2 <- lm(cnt ~ . - temp - atemp, data = data_2)

cor(resid(temp_model_1), resid(temp_model_2))
## [1] 0.3801001

Now we will check the partial correlation coefficient between atemp and cnt, removing temp:

temp_model_1 <- lm(atemp ~ . - cnt - temp, data = data_2)
temp_model_2 <- lm(cnt ~ . - atemp - temp, data = data_2)

cor(resid(temp_model_1), resid(temp_model_2))
## [1] 0.3757927

These results indicate that temp is more correlated with cnt than atemp is, so we will remove atemp and leave temp:

data_3= data_2[,-8]
bike_mod_all_3=lm(cnt~., data=data_3)
summary(bike_mod_all_3)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.8422094
calc_loocv_rmse(bike_mod_all_3) # get the loocv rmse
## [1] 789.3204

This change has slightly lowered the adjusted \(R^2\) as we would expect removing a predictor would, but has improved the LOOCV-RMSE, which indicates that it has improved the model.

As we have concerns about the year predictor, we will also look at the partial correlation coefficient between year and count:

yr_mod_0 <- lm(cnt ~ . - yr, data = data_3)
yr_mod_1 <- lm(yr ~ . - cnt, data = data_3)

cor(resid(yr_mod_0), resid(yr_mod_1))
## [1] 0.7942529

This is quite high which indicates that the year has a significant relationship with ridership. We have seen that the ridership seems to be increasing from year to year (with some seasonal cycles within that trend.) Since our data only includes two years this may cause problems with using the model to extrapolate to years that are not included in the training set. However, the high partial correlation coefficient indicates that year is an important predictor so we will leave it in the model.


Outlier Diagnostics

Next we would want to also check for potential outliers , we have 3 ways of doing so :

  • Leverage
  • Standard Residual
  • Cooks Distance

We will be using Cooks Distance to identify any such outlier and see the effect of it on the model.

First will calculate the number of observations flagged by Cooks Distance:

sum(cooks.distance(bike_mod_all_3) > 4 / length(cooks.distance(bike_mod_all_3)))
## [1] 44

Now we will refit a model excluding the identified observations:

cokks_distance = cooks.distance(bike_mod_all_3)
bike_mod_all_4 = lm(cnt ~.,
                    data = data_3,
                    subset = cokks_distance <= 4 / length(cokks_distance))
summary(bike_mod_all_4)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9085799
calc_loocv_rmse(bike_mod_all_4) # get the loocv rmse
## [1] 586.5534

Removing these outliers resulted in a substantial increase in \(R^2\) and a substantial decrease in LOOCV-RMSE.

par(mfrow = c(2, 2))
plot(bike_mod_all_4,col = 'dodgerblue') # do diagnostics

In the Residuals vs Fitted plot we see that the residuals start to look more normal, although we do still see some non-linear patterns which indicate that we may want to try some higher order terms or transformations.

The Residuals vs Leverage plot also looks much neater and the Normal Q-Q plot has also improved.


Interactions

We will now evaluate including interactions:

bike_mod_all_5 = lm(cnt ~.^2,
                    data = data_3,
                    subset = cokks_distance <= 4 / length(cokks_distance))
summary(bike_mod_all_5)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9466695

Including all possible interactions resulted in a substantial improvement in adjusted \(R^2\), however we can not be sure if this improvement is due to the model or merely due to the inclusion of additional predictors.

calc_loocv_rmse(bike_mod_all_5) # get the loocv rmse
## [1] 759.5543

The LOOCV-RMSE has increased, which indicates that the additional terms may have resulted in over fitting the data.

par(mfrow = c(2, 2))
plot(bike_mod_all_5,col = 'dodgerblue') # do diagnostics

length(coef(bike_mod_all_5)) # get the number of params
## [1] 305

The residual vs fitted plot looks random with errors randomly distributed around 0 line, there are still some outliers which are getting highlighted on the plot.

The Q-Q plot looks more normal.

The leverage plot shows some indication of potential new outliers.


Model Selection

Since the model including all possible interactions increased the LOOCV-RMSE, indicating that it was over fitting to the training data, we will perform a backwards AIC step search to remove the non-significant terms.

bike_mod_all_6=step(bike_mod_all_5, trace=0, direction="backward")
length(coef(bike_mod_all_6)) # get the no of params
## [1] 117
summary(bike_mod_all_6)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9481845

While decreasing the number of predictors from 305 to 117, the backwards step search also resulted in a very small improvement in adjusted \(R^2\).

calc_loocv_rmse(bike_mod_all_6) # get the loocv rmse
## [1] 470.9009

The LOOCV-RMSE also significantly improved, which indicates that this smaller model is a much better model for inference.

par(mfrow = c(2, 2))
plot(bike_mod_all_6,col = 'dodgerblue') # do diagnostics

The Residuals vs Fitted plot still looks normal.


Polynomial Features

We have previously seen some issues which indicated that polynomial features might improve the model. We will now evaluate including them:

temp_mod = lm(cnt ~ .+I(temp^2)+I(hum^2)+I(windspeed^2),
                    data = data_3,
                    subset = cokks_distance <= 4 / length(cokks_distance))
bike_mod_all_7=step(temp_mod, trace=0, direction="backward")
summary(bike_mod_all_7)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9203027

The adjusted \(R^2\) is lower than it was for our interaction model.

calc_loocv_rmse(bike_mod_all_7) # get the loocv rmse
## [1] 548.4574

The LOOCV-RMSE has also increased.

par(mfrow = c(2, 2))
plot(bike_mod_all_7,col = 'dodgerblue') # do diagnostics

The residual vs fitted plot does not look random and shows some non linear pattern, which indicate that the inclusion of these terms has not improved the model. However \(temp^2\) has a very low p-value which indicates that it may be useful. We will keep this in mind.


Transformations

We will now evaluate taking a log transformation of the response.

temp_m = lm(log(cnt) ~.^2,
                    data = data_3,
                    subset = cokks_distance <= 4 / length(cokks_distance))
bike_mod_all_8=step(temp_m, trace=0, direction="backward")
summary(bike_mod_all_8)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9378731

The adjusted \(R^2\) has been lowered by this transformation.

par(mfrow = c(2, 2))
plot(bike_mod_all_8,col = 'dodgerblue') # do diagnostics

In addition we see issues in both the Residuals vs Fitted plot and the Normal Q-Q plot. We can conclude that this transformation was not helpful.


Poisson Regression

Since the target is a count variable, we will evaluate a Poisson regression.

poisson_mod = glm(cnt ~ (. ^2),
                    data = data_3,
                    subset = cokks_distance <= 4 / length(cokks_distance), family=poisson)

bike_mod_all_9 = step(poisson_mod, trace=0, direction="backward")

Since Poisson models apply transformations we will need to do a bit of work to estimate the LOOCV-RMSE:

filter = cokks_distance <= 4 / length(cokks_distance)
e <- data_3$cnt[filter] - bike_mod_all_9$fitted.values
temp=(e / (1 - hatvalues(bike_mod_all_9)))^2
temp=temp[is.finite(temp)]
sqrt(mean(temp))
## [1] 629.2086

While we are not confident that this LOOCV-RMSE value is correct given the internal transformations performed by Poisson GLMs, the value is in line with those produced by earlier models and is significantly higher than our best models.

par(mfrow = c(2, 2))
plot(bike_mod_all_9,col = 'dodgerblue') # do diagnostics

While we see some issues with the lower tail of the Normal Q-Q plot the Residuals vs Fitted plot indicates no issues.

Due to the lack of available metrics to compare the Poisson regressor with our other models, we will continue with the non-GLM models.


Interactions with Polynomial Terms

Finally, since some of the polynomial terms seemed as if they could be significant, we will try a model which includes interactions and polynomial terms:

bike_mod_int_poly = lm(cnt ~(. ^ 2) + I(temp^2) + I(hum^2) + I(windspeed^2),
                    data = data_3,
                    subset = cokks_distance <= 4 / length(cokks_distance))
bike_mod_all_10 = step(bike_mod_int_poly, trace=0, direction="backward")
summary(bike_mod_all_10)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9497541

The adjusted \(R^2\) is the best we have found yet.

calc_loocv_rmse(bike_mod_all_10) # get the loocv rmse
## [1] 461.4496

And this model also results in the lowest LOOCV-RMSE.

Finally we will refit the model using the full data set, find the observations with high leverage and refit the model excluding those items:

bike_mod_int_poly_full = lm(cnt ~(. ^ 2) + I(temp^2) + I(hum^2) + I(windspeed^2),
                    data = data_3)
bike_mod_all_11 = step(bike_mod_int_poly_full, trace=0, direction="backward")

cooks_distance = cooks.distance(bike_mod_all_11) # find influential observations and exclude them
filter = cooks_distance <= (4 / length(cooks_distance))
# some points have a cooks distance of NA, we will consider these to be outliers
filter[is.na(filter)] <- FALSE

bike_mod_int_poly_full = lm(cnt ~(. ^ 2) + I(temp^2) + I(hum^2) + I(windspeed^2),
                    data = data_3,
                    subset = filter)
bike_mod_all_11 = step(bike_mod_int_poly_full, trace=0, direction="backward")
summary(bike_mod_all_11)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9581752
calc_loocv_rmse(bike_mod_all_11) # get the loocv rmse
## [1] 426.8036

Finding and removing the observations with high leverage has improved the LOOCV-RMSE and the \(R^2\).



Results

Our best model included both interactions and polynomial terms.

summary(bike_mod_all_11)$coef
##                                       Estimate   Std. Error     t value
## (Intercept)                       -1683.157857    628.16807 -2.67947056
## seasonsummer                       1007.282902    514.22911  1.95882123
## seasonfall                         5346.488321   1355.39766  3.94459021
## seasonwinter                       1875.157098    977.29943  1.91871297
## yr                                 1571.808081    290.91020  5.40306968
## mnthFeb                             -36.596132    426.16630 -0.08587289
## mnthMar                            -874.913934    487.62426 -1.79423792
## mnthApr                           -1649.057717    850.64501 -1.93859683
## mnthMay                            1594.909994   1098.35947  1.45208380
## mnthJun                            5073.201019   1479.97327  3.42790044
## mnthJul                            7675.642562   2280.34604  3.36599904
## mnthAug                             576.367356   2076.57684  0.27755648
## mnthSep                            -233.984044   1569.61641 -0.14907084
## mnthOct                            -248.871111   1256.54949 -0.19805914
## mnthNov                            -406.179652   1189.33609 -0.34151797
## mnthDec                            -947.636192    981.89091 -0.96511352
## holidayyes                        -4186.149391   1605.72061 -2.60702228
## weekdayMon                          205.959122    138.53667  1.48667586
## weekdayTue                          349.335237    136.33276  2.56237199
## weekdayWed                          404.042478    140.25066  2.88085977
## weekdayThu                          236.222443    139.94812  1.68792861
## weekdayFri                          430.350153    141.74698  3.03604465
## weekdaySat                          -16.410314    130.45377 -0.12579409
## weathersitMisty                     551.344014    350.45802  1.57320989
## weathersitLightPrecip            -14720.916520 155712.09794 -0.09453932
## temp                               7183.427951   1596.36499  4.49986562
## hum                                3899.353693   1665.27408  2.34156872
## windspeed                          8690.056734   1838.65179  4.72631999
## I(temp^2)                         -5972.663959   2799.15244 -2.13374015
## I(hum^2)                          -2106.785083   1306.74444 -1.61223956
## I(windspeed^2)                    -9860.844181   2280.57188 -4.32384714
## seasonsummer:yr                    1285.262946    264.17642  4.86516910
## seasonfall:yr                       533.058497    326.77930  1.63124928
## seasonwinter:yr                     659.955073    313.89887  2.10244491
## seasonsummer:holidayyes            -872.834260    771.16033 -1.13184538
## seasonfall:holidayyes             -1920.061232   1083.21910 -1.77255112
## seasonsummer:weekdayMon            -719.252826    174.75416 -4.11579812
## seasonfall:weekdayMon              -355.732004    172.02576 -2.06789963
## seasonwinter:weekdayMon            -242.071372    174.39260 -1.38808282
## seasonsummer:weekdayTue            -647.481007    170.66887 -3.79378494
## seasonfall:weekdayTue              -160.705969    165.72005 -0.96974365
## seasonwinter:weekdayTue               9.148958    172.05279  0.05317529
## seasonsummer:weekdayWed            -710.700517    165.60444 -4.29155482
## seasonfall:weekdayWed              -302.930356    169.48059 -1.78740442
## seasonwinter:weekdayWed            -209.656446    173.69328 -1.20704985
## seasonsummer:weekdayThu            -476.993276    167.14545 -2.85376173
## seasonfall:weekdayThu              -252.243035    166.94572 -1.51092845
## seasonwinter:weekdayThu            -107.387186    177.16770 -0.60613299
## seasonsummer:weekdayFri            -408.566665    170.47384 -2.39665308
## seasonfall:weekdayFri              -308.933488    165.82517 -1.86300718
## seasonwinter:weekdayFri            -229.443635    170.85439 -1.34291917
## seasonsummer:weekdaySat             231.030011    165.28014  1.39780862
## seasonfall:weekdaySat               263.169325    163.59916  1.60862268
## seasonwinter:weekdaySat             371.076020    164.70581  2.25296254
## seasonsummer:weathersitMisty       -722.271467    329.65371 -2.19100058
## seasonfall:weathersitMisty         -563.891641    386.74521 -1.45804426
## seasonwinter:weathersitMisty       -627.249044    330.50239 -1.89786538
## seasonfall:weathersitLightPrecip  -1223.564090  28423.50219 -0.04304762
## seasonsummer:temp                 -4982.990491   1674.57193 -2.97568017
## seasonfall:temp                  -10478.551306   2520.37862 -4.15753063
## seasonwinter:temp                 -4927.242674   1542.84491 -3.19360853
## seasonsummer:hum                   2510.784980   1053.21752  2.38391873
## seasonfall:hum                     1984.759437   1633.72187  1.21486985
## seasonwinter:hum                   2244.421218   1554.85212  1.44349497
## yr:mnthFeb                          -23.032973    176.95678 -0.13016157
## yr:mnthMar                          161.433894    214.44466  0.75279980
## yr:mnthApr                         -933.972945    348.39643 -2.68077645
## yr:mnthMay                        -1730.266988    378.14572 -4.57566195
## yr:mnthJun                        -1460.191848    392.56008 -3.71966464
## yr:mnthJul                         -906.094655    447.75498 -2.02363950
## yr:mnthAug                         -394.949341    430.67504 -0.91704721
## yr:mnthSep                         -336.952954    403.58612 -0.83489729
## yr:mnthOct                          -25.875650    382.07223 -0.06772450
## yr:mnthNov                         -519.130822    374.19104 -1.38734166
## yr:mnthDec                         -600.743332    317.58216 -1.89161550
## yr:weekdayMon                       447.472396    119.09175  3.75737534
## yr:weekdayTue                       357.617298    119.33159  2.99683673
## yr:weekdayWed                       427.096412    121.77112  3.50737041
## yr:weekdayThu                       655.717206    121.01662  5.41840634
## yr:weekdayFri                       525.159691    118.57263  4.42901264
## yr:weekdaySat                       583.689956    117.20206  4.98020202
## yr:weathersitMisty                 -289.958296     94.11826 -3.08078680
## yr:temp                            1712.948932    558.46150  3.06726412
## yr:hum                             -968.212312    371.60507 -2.60548734
## yr:windspeed                       -942.668377    487.55091 -1.93347683
## mnthFeb:weathersitMisty            -113.544874    210.17913 -0.54022907
## mnthMar:weathersitMisty            -165.056877    263.32604 -0.62681562
## mnthApr:weathersitMisty            -344.205723    419.60135 -0.82031604
## mnthMay:weathersitMisty             405.748885    428.55351  0.94678698
## mnthJun:weathersitMisty            -286.810348    446.77550 -0.64195631
## mnthJul:weathersitMisty            -330.613027    503.78410 -0.65625935
## mnthAug:weathersitMisty             -75.573267    490.25090 -0.15415222
## mnthSep:weathersitMisty            -262.344593    444.44444 -0.59027534
## mnthOct:weathersitMisty            -121.263680    406.18655 -0.29854184
## mnthNov:weathersitMisty             403.136037    387.55104  1.04021405
## mnthDec:weathersitMisty             544.194050    340.07966  1.60019580
## mnthOct:weathersitLightPrecip     -2160.039406  29281.33694 -0.07376847
## mnthFeb:temp                       -537.529140   1124.60346 -0.47797215
## mnthMar:temp                       3187.552028   1393.64365  2.28720738
## mnthApr:temp                       8776.005504   2373.96634  3.69676913
## mnthMay:temp                       7050.072966   2851.35326  2.47253578
## mnthJun:temp                       1241.808461   3251.55191  0.38191254
## mnthJul:temp                      -2171.924595   4051.26205 -0.53611062
## mnthAug:temp                       7227.527231   3819.78852  1.89212759
## mnthSep:temp                       9624.696390   3315.19895  2.90320325
## mnthOct:temp                       7343.714804   2337.15085  3.14216552
## mnthNov:temp                       4487.196466   2367.00697  1.89572592
## mnthDec:temp                       5835.861100   1806.21144  3.23099553
## mnthFeb:hum                         646.600316    723.38478  0.89385392
## mnthMar:hum                         382.660731    828.30867  0.46197842
## mnthApr:hum                       -1073.359121   1289.66474 -0.83227764
## mnthMay:hum                       -4150.273136   1335.75544 -3.10706063
## mnthJun:hum                       -3742.574846   1339.23320 -2.79456545
## mnthJul:hum                       -3628.858102   1832.90232 -1.97984261
## mnthAug:hum                       -3939.536157   1854.49777 -2.12431432
## mnthSep:hum                       -4590.272225   1874.44235 -2.44887351
## mnthOct:hum                       -2888.400625   1841.22892 -1.56873520
## mnthNov:hum                       -1784.941857   1767.63854 -1.00978895
## mnthDec:hum                       -1577.333641   1554.63469 -1.01460083
## holidayyes:hum                     6725.198944   3288.03915  2.04535245
## weekdayMon:weathersitMisty          200.222516    135.29996  1.47984162
## weekdayTue:weathersitMisty          -69.502245    135.04967 -0.51464211
## weekdayWed:weathersitMisty          -47.747137    135.91327 -0.35130593
## weekdayThu:weathersitMisty           29.179093    134.83998  0.21639794
## weekdayFri:weathersitMisty         -159.463510    136.16194 -1.17113126
## weekdaySat:weathersitMisty         -162.241594    137.11107 -1.18328590
## weekdayMon:weathersitLightPrecip   1433.004460   3176.29814  0.45115553
## weathersitMisty:temp               1859.348438    556.16869  3.34313756
## weathersitLightPrecip:temp        27925.347443 340979.06826  0.08189754
## weathersitMisty:hum               -1743.947562    543.30563 -3.20988312
## hum:windspeed                     -9939.237300   1881.86830 -5.28157965
##                                      Pr(>|t|)
## (Intercept)                      7.604432e-03
## seasonsummer                     5.066067e-02
## seasonfall                       9.075035e-05
## seasonwinter                     5.556053e-02
## yr                               9.935915e-08
## mnthFeb                          9.316001e-01
## mnthMar                          7.334826e-02
## mnthApr                          5.308403e-02
## mnthMay                          1.470734e-01
## mnthJun                          6.558348e-04
## mnthJul                          8.182975e-04
## mnthAug                          7.814617e-01
## mnthSep                          8.815548e-01
## mnthOct                          8.430752e-01
## mnthNov                          7.328498e-01
## mnthDec                          3.349306e-01
## holidayyes                       9.392048e-03
## weekdayMon                       1.376985e-01
## weekdayTue                       1.067277e-02
## weekdayWed                       4.127438e-03
## weekdayThu                       9.201658e-02
## weekdayFri                       2.515574e-03
## weekdaySat                       8.999429e-01
## weathersitMisty                  1.162704e-01
## weathersitLightPrecip            9.247167e-01
## temp                             8.373268e-06
## hum                              1.957449e-02
## windspeed                        2.937482e-06
## I(temp^2)                        3.332514e-02
## I(hum^2)                         1.075085e-01
## I(windspeed^2)                   1.834043e-05
## seasonsummer:yr                  1.512655e-06
## seasonfall:yr                    1.034351e-01
## seasonwinter:yr                  3.598810e-02
## seasonsummer:holidayyes          2.582142e-01
## seasonfall:holidayyes            7.688075e-02
## seasonsummer:weekdayMon          4.477372e-05
## seasonfall:weekdayMon            3.913686e-02
## seasonwinter:weekdayMon          1.656983e-01
## seasonsummer:weekdayTue          1.655685e-04
## seasonfall:weekdayTue            3.326191e-01
## seasonwinter:weekdayTue          9.576124e-01
## seasonsummer:weekdayWed          2.111696e-05
## seasonfall:weekdayWed            7.444669e-02
## seasonwinter:weekdayWed          2.279544e-01
## seasonsummer:weekdayThu          4.490447e-03
## seasonfall:weekdayThu            1.314060e-01
## seasonwinter:weekdayThu          5.446874e-01
## seasonsummer:weekdayFri          1.689300e-02
## seasonfall:weekdayFri            6.301736e-02
## seasonwinter:weekdayFri          1.798762e-01
## seasonsummer:weekdaySat          1.627586e-01
## seasonfall:weekdaySat            1.082977e-01
## seasonwinter:weekdaySat          2.467196e-02
## seasonsummer:weathersitMisty     2.888888e-02
## seasonfall:weathersitMisty       1.454240e-01
## seasonwinter:weathersitMisty     5.826000e-02
## seasonfall:weathersitLightPrecip 9.656799e-01
## seasonsummer:temp                3.057572e-03
## seasonfall:temp                  3.754637e-05
## seasonwinter:temp                1.489036e-03
## seasonsummer:hum                 1.748238e-02
## seasonfall:hum                   2.249598e-01
## seasonwinter:hum                 1.494752e-01
## yr:mnthFeb                       8.964883e-01
## yr:mnthMar                       4.519064e-01
## yr:mnthApr                       7.575227e-03
## yr:mnthMay                       5.925594e-06
## yr:mnthJun                       2.208904e-04
## yr:mnthJul                       4.351140e-02
## yr:mnthAug                       3.595374e-01
## yr:mnthSep                       4.041539e-01
## yr:mnthOct                       9.460306e-01
## yr:mnthNov                       1.659240e-01
## yr:mnthDec                       5.909025e-02
## yr:weekdayMon                    1.908701e-04
## yr:weekdayTue                    2.856506e-03
## yr:weekdayWed                    4.912061e-04
## yr:weekdayThu                    9.161609e-08
## yr:weekdayFri                    1.151713e-05
## yr:weekdaySat                    8.622295e-07
## yr:weathersitMisty               2.172360e-03
## yr:temp                          2.271257e-03
## yr:hum                           9.433685e-03
## yr:windspeed                     5.371271e-02
## mnthFeb:weathersitMisty          5.892673e-01
## mnthMar:weathersitMisty          5.310515e-01
## mnthApr:weathersitMisty          4.124068e-01
## mnthMay:weathersitMisty          3.441812e-01
## mnthJun:weathersitMisty          5.211807e-01
## mnthJul:weathersitMisty          5.119437e-01
## mnthAug:weathersitMisty          8.775487e-01
## mnthSep:weathersitMisty          5.552591e-01
## mnthOct:weathersitMisty          7.654073e-01
## mnthNov:weathersitMisty          2.987175e-01
## mnthDec:weathersitMisty          1.101544e-01
## mnthOct:weathersitLightPrecip    9.412226e-01
## mnthFeb:temp                     6.328682e-01
## mnthMar:temp                     2.257921e-02
## mnthApr:temp                     2.412296e-04
## mnthMay:temp                     1.373035e-02
## mnthJun:temp                     7.026802e-01
## mnthJul:temp                     5.921083e-01
## mnthAug:temp                     5.902186e-02
## mnthSep:temp                     3.848437e-03
## mnthOct:temp                     1.771322e-03
## mnthNov:temp                     5.854311e-02
## mnthDec:temp                     1.310630e-03
## mnthFeb:hum                      3.718081e-01
## mnthMar:hum                      6.442874e-01
## mnthApr:hum                      4.056292e-01
## mnthMay:hum                      1.991438e-03
## mnthJun:hum                      5.386284e-03
## mnthJul:hum                      4.824101e-02
## mnthAug:hum                      3.410881e-02
## mnthSep:hum                      1.465508e-02
## mnthOct:hum                      1.173099e-01
## mnthNov:hum                      3.130598e-01
## mnthDec:hum                      3.107617e-01
## holidayyes:hum                   4.131616e-02
## weekdayMon:weathersitMisty       1.395130e-01
## weekdayTue:weathersitMisty       6.070189e-01
## weekdayWed:weathersitMisty       7.254993e-01
## weekdayThu:weathersitMisty       8.287613e-01
## weekdayFri:weathersitMisty       2.420753e-01
## weekdaySat:weathersitMisty       2.372294e-01
## weekdayMon:weathersitLightPrecip 6.520630e-01
## weathersitMisty:temp             8.872392e-04
## weathersitLightPrecip:temp       9.347593e-01
## weathersitMisty:hum              1.408783e-03
## hum:windspeed                    1.876616e-07

It uses 151 predictors, which is quite a lot, and the p-values for many of them indicate that they may not be significant. However, it appears that the non-significant terms are levels for factor variables which may be difficult to remove.

plot(bike_mod_all_11$fitted.values, data_3$cnt[filter], main = "Fitted Values vs Actual", xlab = "Fitted Values", ylab = "Ground Truth", col = "darkblue", pch = 20)
abline(0, 1, col = "firebrick4", lwd = 3)

We also see that the fitted value for this model are quite close to the actual values.

summary(bike_mod_all_11)$adj.r
## [1] 0.9581752

The adjusted \(R^2\) indicates that the model explains 0.9581752 of the variance in the data.

The LOOCV-RMSE of the model is 426.8036309.

par(mfrow = c(2, 2))
plot(bike_mod_all_11,col = 'dodgerblue') # do diagnostics

The diagnostic plots all appear to be acceptable, although there are still some points with very high leverage.

sqrt(mean(bike_mod_all_11$residuals^2))
## [1] 343.8317

The model has an rmse of 343.8317181 which means that on an average the model predictions are off by this much amount.

# Count the number of high VIF's in the model
sum(vif(bike_mod_all_11) > 10)
## [1] 96

This was expected due to high number of categorical variables and their interactions. Also since we are mainly focusing on demand prediction, we tuned our model for better prediction by sacrificing interpretability.

hist(resid(bike_mod_all_11), col = "lightslateblue", main = "Histogram of Residuals", xlab = "Residuals")

The histogram of residuals looks nearly normal which confirms the fact that the model has noise which are normally distributed and centered about 0.

So far we have seen that we tried multiple approaches to evolve our understanding of what should work to have a model with a good performance without sacrificing the high variance nature of the model and looks like we have reached a decent spot where we not only have a model with high adjusted \(R^2\) value but also a low loocv rmse.



Discussion

While the final model contained a large number of predictors, making it difficult to interpret, the results indicate that it is very good at explaining the relationship between weather, season and bike sharing rentals. We tried evaluating using BIC to reduce the size of the final model, however doing so resulted in a lowered LOOCV-RMSE, so we preferred the AIC selected model.

As expected, the weather situation is an especially important predictor, both by itself and in its interactions with other predictors, indicating that rain has a significant impact on the number of rentals, especially the interaction between light rain and windspeed.

We have some concerns about the inclusion of the year as a predictor as it may cause problems with using the model for inference on unseen data. While initially it was used as a categorical predictor, we changed it to a numeric predictor in the hopes that it would make the model generalize better to data from the future, should that ever be required. We attempted to model the data excluding year, however the general increasing ridership between the years proved to be an important factor. Future work could incorporate data from additional years to see if this trend continues.

The high adjusted \(R^2\) of the model indicates that a very large portion of the variance in the data can be explained by this model, which would make it very useful for predicting demand for bikes.

While we feel that a Poisson model would lend itself nicely to this data, the lack of available metrics to compare Poisson models with normal linear models made it prohibitively complicated to pursue, and the excellent performance of the non-Poisson models reduced the importance of pursuing Poisson models.

Future Work

This dataset proved to be multifarious and was very interesting to work with. However, due to time constraints, we were not able to do as complete an analysis of the data as we would have liked.

While our model achieved impressive results, based on the distribution of data between demand and type of day, as well as from the outlier detection through Cook’s method, it might be suitable to build separate models for Holidays, Workdays and Weekends.

Our model could also be extended to apply Seasonality analysis to see how season and days of week affect the response. As we had concerns about the inclusion of the year as a predictor given the limited amount of training data, if data for additional years is available, future work could also analyze the overall demand trend year after year after controlling for seasonal components in the data.

Finally, as was previously mentioned, we focused on accurate prediction at the expense of interpretability. Future work could attempt to find models which are more easily interpreted.



Appendix


Distribution of Target

data = read.csv("day.csv")
summary(data$cnt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      22    3152    4548    4504    5956    8714
hist(data$cnt, col = "lightslateblue", main = "Histogram of Total Ridership Count", xlab = "Total Ridership Count")

It seems sort of normally distributed, although since it is a count it should follow a Poisson distribution rather than normal.


Pairs plots between numeric features

pairs(data[,10:16], main = "Pairs Plot for Numeric Features")

It looks like there are relationships between temp and cnt, although not necessarily linear. The other relationships are not so clear.


Correlation Between Numeric Features

cor(data[,10:16])
##                  temp      atemp         hum  windspeed      casual
## temp        1.0000000  0.9917016  0.12696294 -0.1579441  0.54328466
## atemp       0.9917016  1.0000000  0.13998806 -0.1836430  0.54386369
## hum         0.1269629  0.1399881  1.00000000 -0.2484891 -0.07700788
## windspeed  -0.1579441 -0.1836430 -0.24848910  1.0000000 -0.16761335
## casual      0.5432847  0.5438637 -0.07700788 -0.1676133  1.00000000
## registered  0.5400120  0.5441918 -0.09108860 -0.2174490  0.39528245
## cnt         0.6274940  0.6310657 -0.10065856 -0.2345450  0.67280443
##            registered        cnt
## temp        0.5400120  0.6274940
## atemp       0.5441918  0.6310657
## hum        -0.0910886 -0.1006586
## windspeed  -0.2174490 -0.2345450
## casual      0.3952825  0.6728044
## registered  1.0000000  0.9455169
## cnt         0.9455169  1.0000000


Distributions of Numeric Features

summary(data[,10:13])
##       temp             atemp              hum           windspeed      
##  Min.   :0.05913   Min.   :0.07907   Min.   :0.0000   Min.   :0.02239  
##  1st Qu.:0.33708   1st Qu.:0.33784   1st Qu.:0.5200   1st Qu.:0.13495  
##  Median :0.49833   Median :0.48673   Median :0.6267   Median :0.18097  
##  Mean   :0.49538   Mean   :0.47435   Mean   :0.6279   Mean   :0.19049  
##  3rd Qu.:0.65542   3rd Qu.:0.60860   3rd Qu.:0.7302   3rd Qu.:0.23321  
##  Max.   :0.86167   Max.   :0.84090   Max.   :0.9725   Max.   :0.50746
par(mfrow=c(2,2))
hist(data[,10], main="Temp", xlab = "Temperature (Celsius)", col = "tomato")

hist(data[,11], main="ATemp", xlab = "Normalized Temperature", col = "tomato2")

hist(data[,12], main="Humidity", xlab = "Humidity", col = "turquoise1")

hist(data[,13], main="Windspeed", xlab = "Windspeed", col = "skyblue")

These do not appear to be normally distributed, although Windspeed could possibly be turned into normal distributions with transformations and Humidity is fairly close to normal.


Distributions of Categorical Features

par(mfrow=c(1,3))
barplot(prop.table(table(data$season)), col = 1:4, main = "Distribution of Season", xlab = "Season", ylab = "Count")

barplot(prop.table(table(data$mnth)), col = 1:12,  main = "Distribution of Months", xlab = "Month", ylab = "Count")

barplot(prop.table(table(data$weekday)), col = 1:12,  main = "Distribution of Weekdays", xlab = "Weekday?", ylab = "Count")

barplot(prop.table(table(data$holiday)), col = 6:12,  main = "Distribution of Holidays", xlab = "Holiday", ylab = "Count")

barplot(prop.table(table(data$workingday)), col = 8:12,  main = "Distribution of Working Days", xlab = "Working Day", ylab = "Count")

barplot(prop.table(table(data$weathersit)), col = 10:12,  main = "Distribution of Weather Types", xlab = "Weather Type", ylab = "Count")

It appears that Months, Seasons and Weekdays are distributed uniformly. However, there are very few holidays, more working days than non-working days and the weather seems to be mostly clear.


Categorical Features vs Response

# create a color palette
palette(brewer.pal(n = 12, name = "Set3"))

par(mfrow=c(1,3))
barplot(aggregate(x = data$cnt, by = list(data$mnth), FUN = sum)$x, names = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), main = "Total Ridership By Month", col = 1:12)

palette(brewer.pal(n = 4, name = "Accent"))
barplot(aggregate(x = data$cnt, by = list(data$season), FUN = sum)$x, names = c("spring", "summer", "fall", "winter"), main = "Total Ridership By Season", col = 1:4)

palette(brewer.pal(n = 4, name = "Set1"))
barplot(aggregate(x = data$cnt, by = list(data$weathersit), FUN = sum)$x, names = c("Clearish", "Misty", "LightPrecip"), main = "Total Ridership By Weather Type", col = 1:4)

There appears to be relationships between the Month, Season and Weather and Total Ridership. However the Total Ridership by Weather Type resembles the distribution of Weather, so this relationship may not be so important.

par(mfrow=c(1,3))
palette(brewer.pal(n = 4, name = "Set3"))
barplot(aggregate(x = data$cnt, by = list(data$holiday), FUN = sum)$x, names = c("No", "Yes"), main = "Total Ridership By Holiday", col = 1:12)

palette(brewer.pal(n = 4, name = "Accent"))
barplot(aggregate(x = data$cnt, by = list(data$workingday), FUN = sum)$x, names = c("No", "Yes"), main = "Total Ridership By Workingday", col = 1:4)

palette(brewer.pal(n = 7, name = "Set2"))
barplot(aggregate(x = data$cnt, by = list(data$weekday), FUN = sum)$x, names = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"), main = "Total Ridership By Weekday", col = 1:7)

While there is a higher ridership on non-holidays and workingdays, these very closely resemble the distributions of Holidays and Workingdays in the data, so these may not be very useful as predictors.

par(mfrow=c(1,3))
palette(brewer.pal(n = 6, name = "Dark2"))
plot(data$temp, data$cnt, main = "Temp vs Total Ridership", xlab = "Temp", ylab = "Total Ridership", col = 1)

plot(data$hum, data$cnt, main = "Humidity vs Total Ridership", ylab = "Total Ridership", xlab = "Humidity", col = 2)

plot(data$windspeed, data$cnt, main = "Windspeed vs Total Ridership", ylab = "Total Ridership", xlab = "Windspeed", col = 3)

It seems that there is a somewhat linear relationship between temperature and total ridership, although it may be more polynomial. It is not clear what the relationships between Humidity and Windspeed and Ridership are.

par(mfrow=c(1,3))
plot(data$windspeed, data$hum, main = "Windspeed vs Humidity", ylab = "Humidity", xlab = "Windspeed", col = 4)

plot(data$windspeed, data$temp, main = "Windspeed vs Temp", ylab = "Temp", xlab = "Windspeed", col = 5)

plot(data$hum, data$temp, main = "Humidity vs Temp", ylab = "Temp", xlab = "Humidity", col = 6)

It seems as if these numerical features may be related somehow, although they do not seem to be related linearly.


Finding Interactions

First we will look at box plots of Total Ridership by the categorical features to see if anything stands out as worth exploring:

palette(brewer.pal(n = 12, name = "Set3"))
par(mfrow = c(1,2))
boxplot(cnt ~ weathersit, data = data, col = 1:4, main = "Count by Weather", ylab = "Total Ridership")
boxplot(cnt ~ season, data = data, col = 5:8, main = "Count by Season", ylab = "Total Ridership")

boxplot(cnt ~ mnth, data = data, col = 1:12, main = "Count by Month", ylab = "Total Ridership")
boxplot(cnt ~ weekday, data = data, col = 1:7, main = "Count by Weekday", ylab = "Total Ridership")

boxplot(cnt ~ workingday, data = data, col = 8:10, main = "Count by Working Day", ylab = "Total Ridership")
boxplot(cnt ~ holiday, data = data, col = 10:12, main = "Count by Holiday", ylab = "Total Ridership")

It appears that Weather, Season and Month may be worth further exploration, Weekday and Working Day don’t seem to be of much interest, and while the mean Ridership on Holidays is lower than on non-Holidays, it is not significantly lower.

We will now look at some interactions between the categorical features we identified above and some numerical features.

par(mfrow=c(1,2))
palette(brewer.pal(n = 4, name = "Set1"))
plot(data$temp, data$cnt, col = data$season, pch = 20, main = "Ridership vs Temp by Season", xlab = "Temperature", ylab = "Ridership")
legend("topleft", legend = c("Spring", "Summer", "Fall", "Winter"), col = 1:5, lwd = 1, lty = c(0,0), pch = 20)

plot(data$hum, data$cnt, col = data$season, pch = 20, main = "Ridership vs Humidity by Season", xlab = "Humidity", ylab = "Ridership")
legend("topleft", legend = c("Spring", "Summer", "Fall", "Winter"), col = 1:5, lwd = 1, lty = c(0,0), pch = 20)

As we would expect the temperature to be correlated to the season the Ridership vs Temperature plot doesn’t show any interesting patterns. However, the Ridership vs Humidity plot has vertical clusters which could indicate interesting correlations.

par(mfrow=c(1,2))
palette(brewer.pal(n = 4, name = "Set2"))
plot(data$temp, data$cnt, col = data$weathersit, pch = 20, main = "Ridership vs Temp by Weather", xlab = "Temperature", ylab = "Ridership")
legend("topleft", legend = c("Clear", "Misty", "Light Precip"), col = 1:5, lwd = 1, lty = c(0,0), pch = 20)

plot(data$hum, data$cnt, col = data$weathersit, pch = 20, main = "Ridership vs Humidity by Weather", xlab = "Humidity", ylab = "Ridership")
legend("topleft", legend = c("Clear", "Misty", "Light Precip"), col = 1:5, lwd = 1, lty = c(0,0), pch = 20)

Similarly, as we would expect Humidity to be correlated with Weather the Ridership vs Humidity plot doesn’t show any interesting patterns while the Ridership vs Temperature plot indicates that there may be some

par(mfrow = c(1,2))
palette(brewer.pal(n = 12, name = "Set3"))
plot(data$temp, data$cnt, col = data$mnth, pch = 20, main = "Ridership vs Temp by Month", xlab = "Temp", ylab = "Ridership")

plot(data$hum, data$cnt, col = data$mnth, pch = 20, main = "Ridership vs Humidity by Month", xlab = "Temp", ylab = "Ridership")
legend("topleft", legend = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), col = 1:12, lwd = 1, lty = c(0,0), pch = 20)

It is difficult to draw conclusions from the plots above, however both interaction may merit further investigation.